R is a programming language designed originally for performing statistical analyses. However, it’s potential has been increased and nowadays it can be used for general data analysis, creating atractive plots or even for creating websites! You can download R from the official repository: CRAN.
If you have used spreadsheets to calculate stuff, you have already programmed.
In R, you can assign values to variables using <-. You don’t have to type it everytime, just Alt+-!
variable <- 2342
variable
## [1] 2342
You can use R from the unix command line or download an Integrated Development Environment (IDE) such as RStudio.
RStudio overview. Source: http://www.sthda.com/english/wiki/r-basics-quick-and-easy
Rscripts. Write your code and run it whenever you want using Ctrl + Enter.The basic types of variables that can be represented in R are the following:
332, 0.234."Barcelona", "woman". A special type of character is the class factor.TRUE or FALSE.You can check the type of variable using class().
class() function.
var <- "text" # character
class(var)
## [1] "character"
var <- 3242 # numeric
class(var)
## [1] "numeric"
var <- TRUE # logical
class(var)
## [1] "logical"
| Operator | Description |
|---|---|
+ |
addition |
- |
subtraction |
* |
multiplication |
/ |
division |
^ or ** |
exponentiation |
3+4*5.2/2^2
## [1] 8.2
| Operator | Description |
|---|---|
> |
greater than |
>= |
greater than or equal to |
== |
exactly equal to |
!= |
not equal to |
# Works with numerics
3 > 2
## [1] TRUE
# And also with characters
"class"=="class"
## [1] TRUE
We can combine several data types into more complex structures. The different structures one can create in R are the following:
c() of values. The type of all values should be the same.vector <- c(0,1,2,3,4,6)
vector
## [1] 0 1 2 3 4 6
matrix(). You can check which arguments you can pass to the matrix function typing ?matrix or help(matrix)mat <- matrix(vector, nrow=2)
mat
## [,1] [,2] [,3]
## [1,] 0 2 4
## [2,] 1 3 6
vector <- c('a', 'b', 'c', 'd', 'e', 'f')
mat <- matrix(vector, ncol=2)
mat
## [,1] [,2]
## [1,] "a" "d"
## [2,] "b" "e"
## [3,] "c" "f"
patients <- data.frame(patientID=1:4,
gender=c("male", "female", "male", "female"),
age=c(23, 45, 55, 22),
dead=c(F, T, T, F))
patients
## patientID gender age dead
## 1 1 male 23 FALSE
## 2 2 female 45 TRUE
## 3 3 male 55 TRUE
## 4 4 female 22 FALSE
patients$age # to select one column
## [1] 23 45 55 22
listName[[1]].list <- list(name="donors",
patientList=patients)
list
## $name
## [1] "donors"
##
## $patientList
## patientID gender age dead
## 1 1 male 23 FALSE
## 2 2 female 45 TRUE
## 3 3 male 55 TRUE
## 4 4 female 22 FALSE
list[[1]]
## [1] "donors"
list[["name"]]
## [1] "donors"
There are many basic functions that are already loaded into R. For example, data.frame() is a function that generates data.frames and help() is a function that loads the manual for a specific R function.
You can create you own functions in R as follows:
sq <- function(x) {
square <- x*x
return(square)
}
sq(2)
## [1] 4
Functions are approppriate when you are copy&pasting operations many times, but changing tiny details. It might be more time consuming to create a function, but then you an run it with the parameters you want without any additional effort.
Many times, there are functions out there for the things you want to do. This functions are wrapped in packages and you can download them and use them freely!
Packages are the fundamental units of reproducible R code. They include reusable R functions, the documentation that describes how to use them, and sample data. You can find R packages at official repositories like CRAN or in personal repositories like github. Packages found at official repositories are more consistent than those found at personal repositories, which could be still in development.
There are many packages that are already preinstalled in R. You can load those packages which will allow you to access their functions using library(NameOfPackage). Once a package is loaded with library, all its function will be available to you untill you close your R session (i.e. close RStudio).
NameOfPackage::name_of_function(arguments) so you don’t have to load all the package!
The most important R repositories from which you can donwload packages are:
CRAN. Is the official R repository and contains packages for many different purposes. To install packages from CRAN, you just need to type in your R console installPackages("NameOfPackage").
Bioconductor. This repository contains packages dedicated to the analysis and comprehension of high-throughput genomic data. To install packages from Bioconductor, you will first need to install the CRAN package BiocManager, if it is not already installed (install.packages("BiocManager")). If it is installed, you can use the function install from that package with the name of the Bioconductor package you want to install: BioManager::install("NameOfPackage").
The ability to efficiently represent and manipulate genomic annotations and alignments is playing a central role when it comes to analyzing high-throughput sequencing data (a.k.a. NGS data). The GenomicRanges package defines general purpose containers for storing and manipulating genomic intervals and variables defined along a genome.
To generate a GenomicRanges object, we just need to provide the following parameters:
seqnames. Name of the chromosome (for example chr1).ranges. Range of the region you are trying to define (start and end). If your range only has one bp (a SNP, for example) you can include only the start position.strand. If you do not specify any strand (+ or -), it will default to *.mcols. You can include a data.frame with additional information that will be appended to your ranges.Below you can see how to create a GRanges object with only one range:
library(GenomicRanges)
gr <- GRanges(seqnames="chr1",
ranges=IRanges(start=3, end=60))
gr
## GRanges object with 1 range and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 3-60 *
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Or you can create and object with multiple ranges at once:
gr <- GRanges(seqnames=c("chr1", "chr1", "chr3"),
ranges=IRanges(start=c(2,65,23),
end=c(43, 192, 60)),
mcols=data.frame(id=c("region1", "region2", "region3")))
gr
## GRanges object with 3 ranges and 1 metadata column:
## seqnames ranges strand | mcols.id
## <Rle> <IRanges> <Rle> | <factor>
## [1] chr1 2-43 * | region1
## [2] chr1 65-192 * | region2
## [3] chr3 23-60 * | region3
## -------
## seqinfo: 2 sequences from an unspecified genome; no seqlengths
You can access the different elements in your GRanges object with different functions:
seqnames(gr) # Returns all chr names
## factor-Rle of length 3 with 2 runs
## Lengths: 2 1
## Values : chr1 chr3
## Levels(2): chr1 chr3
start(gr) # Returns start position, same with end(gr)
## [1] 2 65 23
width(gr)
## [1] 42 128 38
mcols(gr) # Returns mcols data.frame
## DataFrame with 3 rows and 1 column
## mcols.id
## <factor>
## 1 region1
## 2 region2
## 3 region3
gr$mcols.id # Returns a specific mcols data.frame
## [1] region1 region2 region3
## Levels: region1 region2 region3
length(granges) # Number of regions in your GRanges object
## [1] 1
You can load and convert to GRanges the peak files generated from MACS2. The easiest way is using the function toGRanges from the package regioneR. You just need to provide the path to your peak file and it will load it and convert it to a GRanges object. Since we only need one specific function from the GRanges package, we are going to use regioneR::toGRanges.
peaks <- regioneR::toGRanges("data/ChIP-seq/peaks/NL1_h3k27ac_peaks.broadPeak")
peaks
## GRanges object with 39617 ranges and 5 metadata columns:
## seqnames ranges strand |
## <Rle> <IRanges> <Rle> |
## [1] chr1 9923-10265 * |
## [2] chr1 19511-19710 * |
## [3] chr1 28763-29492 * |
## [4] chr1 96299-98434 * |
## [5] chr1 101192-105777 * |
## ... ... ... ... .
## [39613] chrY 58887301-58888313 * |
## [39614] chrY 58905991-58906437 * |
## [39615] chrY 58912083-58912805 * |
## [39616] chrY 58914696-58915173 * |
## [39617] chrY 59362978-59363492 * |
## name score signalValue
## <character> <numeric> <numeric>
## [1] data/ChIP-seq/peaks/NL1_h3k27ac_peak_1 13 3.08409
## [2] data/ChIP-seq/peaks/NL1_h3k27ac_peak_2 14 3.29854
## [3] data/ChIP-seq/peaks/NL1_h3k27ac_peak_3 44 5.29025
## [4] data/ChIP-seq/peaks/NL1_h3k27ac_peak_4 29 4.38307
## [5] data/ChIP-seq/peaks/NL1_h3k27ac_peak_5 26 4.17705
## ... ... ... ...
## [39613] data/ChIP-seq/peaks/NL1_h3k27ac_peak_39613 15 1.75974
## [39614] data/ChIP-seq/peaks/NL1_h3k27ac_peak_39614 21 1.99217
## [39615] data/ChIP-seq/peaks/NL1_h3k27ac_peak_39615 15 1.7828
## [39616] data/ChIP-seq/peaks/NL1_h3k27ac_peak_39616 12 1.81376
## [39617] data/ChIP-seq/peaks/NL1_h3k27ac_peak_39617 40 3.52012
## pValue qValue
## <numeric> <numeric>
## [1] 3.52954 1.38501
## [2] 3.70218 1.41399
## [3] 7.29983 4.42057
## [4] 5.62744 2.99849
## [5] 5.2358 2.65918
## ... ... ...
## [39613] 3.78496 1.563
## [39614] 4.53431 2.11786
## [39615] 3.73717 1.5316
## [39616] 3.39216 1.28711
## [39617] 6.93227 4.05808
## -------
## seqinfo: 51 sequences from an unspecified genome; no seqlengths
There are many packages for detecting differential sites for ChIP-seq data. One of the most popular is DESeq2.
For detecting differential sites with DESeq2 you need the following:
matrix in which each row will be a genomic region and each column a sample. The values will be the number of reads found in each region, for each sample.data.frame including phenotyping or grouping information for each of the samples found in the columns of the count matrix.The comparison we are going to do is normal tissue (NL) vs. hepatocellular carcinoma (HCC). Therefore, you will need to download the following files from ~/workshop_ChIPseq/ChIPseq/peaks/:
NL1_h3k27ac_peaks.broadPeakHCC1_h3k27ac_peaks.broadPeakHCC3_h3k27ac_peaks.broadPeakNext, we will load this peak files into R and then merge the overlapping genomic regions to have a single peak dataset with all the H3K27ac enriched regions.
# First, we load the peaks into R
files <- paste0("data/ChIP-seq/peaks/",
c("NL1_h3k27ac_peaks.broadPeak", "HCC1_h3k27ac_peaks.broadPeak", "HCC3_h3k27ac_peaks.broadPeak"))
peak.list <- lapply(files, read.delim, header=F) # read peaks in the 3 input files
peaks <- do.call(rbind, peak.list) # convert 3 data.frames in 1 data.frame (row-bind)
head(peaks)
## V1 V2 V3 V4 V5 V6 V7
## 1 chr1 9922 10265 data/ChIP-seq/peaks/NL1_h3k27ac_peak_1 13 . 3.08409
## 2 chr1 19510 19710 data/ChIP-seq/peaks/NL1_h3k27ac_peak_2 14 . 3.29854
## 3 chr1 28762 29492 data/ChIP-seq/peaks/NL1_h3k27ac_peak_3 44 . 5.29025
## 4 chr1 96298 98434 data/ChIP-seq/peaks/NL1_h3k27ac_peak_4 29 . 4.38307
## 5 chr1 101191 105777 data/ChIP-seq/peaks/NL1_h3k27ac_peak_5 26 . 4.17705
## 6 chr1 138376 139748 data/ChIP-seq/peaks/NL1_h3k27ac_peak_6 16 . 3.49893
## V8 V9
## 1 3.52954 1.38501
## 2 3.70218 1.41399
## 3 7.29983 4.42057
## 4 5.62744 2.99849
## 5 5.23580 2.65918
## 6 4.04034 1.68627
# Convert data.frame to GenomicRanges
library(GenomicRanges)
peaks.gr <- GRanges(seqnames=peaks$V1,
ranges=IRanges(start=peaks$V2,
end=peaks$V3))
peaks.gr
## GRanges object with 108571 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 9922-10265 *
## [2] chr1 19510-19710 *
## [3] chr1 28762-29492 *
## [4] chr1 96298-98434 *
## [5] chr1 101191-105777 *
## ... ... ... ...
## [108567] chrY 13868263-13868575 *
## [108568] chrY 26299001-26299201 *
## [108569] chrY 26467450-26467670 *
## [108570] chrY 27663117-27663383 *
## [108571] chrY 58865921-58866234 *
## -------
## seqinfo: 55 sequences from an unspecified genome; no seqlengths
# Keep only autosomal chr
peaks.gr <- peaks.gr[seqnames(peaks.gr) %in% paste0("chr", 1:22),]
# Merge overlapping ranges
peaks.m <- reduce(peaks.gr)
peaks.m
## GRanges object with 59158 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 9922-10265 *
## [2] chr1 19510-19710 *
## [3] chr1 27798-29492 *
## [4] chr1 32747-33314 *
## [5] chr1 95547-105777 *
## ... ... ... ...
## [59154] chr9 141121885-141125359 *
## [59155] chr9 141127738-141128268 *
## [59156] chr9 141129218-141130147 *
## [59157] chr9 141131195-141131597 *
## [59158] chr9 141146940-141147140 *
## -------
## seqinfo: 55 sequences from an unspecified genome; no seqlengths
# Convert GRanges to SAF (needed for Rsubread::featureCounts)
saf <- data.frame(peaks.m)
head(saf)
## seqnames start end width strand
## 1 chr1 9922 10265 344 *
## 2 chr1 19510 19710 201 *
## 3 chr1 27798 29492 1695 *
## 4 chr1 32747 33314 568 *
## 5 chr1 95547 105777 10231 *
## 6 chr1 113560 114424 865 *
saf <- saf[,-4]
colnames(saf) <- c("Chr", "Start", "End", "Strand")
saf$Strand <- "+"
saf$GeneID <- paste0("region_", 1:nrow(saf))
# Get read counts in each region from BAM files
bam.files <- c("data/ChIP-seq/BAM/NL1_h3k27ac.rmdup.bam",
"data/ChIP-seq/BAM/HCC1_h3k27ac.rmdup.bam",
"data/ChIP-seq/BAM/HCC3_h3k27ac.rmdup.bam")
reads <- Rsubread::featureCounts(bam.files,
annot.ext=saf,
nthreads=10)
##
## ========== _____ _ _ ____ _____ ______ _____
## ===== / ____| | | | _ \| __ \| ____| /\ | __ \
## ===== | (___ | | | | |_) | |__) | |__ / \ | | | |
## ==== \___ \| | | | _ <| _ /| __| / /\ \ | | | |
## ==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
## ========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/
## Rsubread 1.32.2
##
## //========================== featureCounts setting ===========================\\
## || ||
## || Input files : 3 BAM files ||
## || S NL1_h3k27ac.rmdup.bam ||
## || S HCC1_h3k27ac.rmdup.bam ||
## || S HCC3_h3k27ac.rmdup.bam ||
## || ||
## || Annotation : R data.frame ||
## || Dir for temp files : . ||
## || Threads : 10 ||
## || Level : meta-feature level ||
## || Paired-end : no ||
## || Multimapping reads : counted ||
## || Multi-overlapping reads : not counted ||
## || Min overlapping bases : 1 ||
## || ||
## \\===================== http://subread.sourceforge.net/ ======================//
##
## //================================= Running ==================================\\
## || ||
## || Load annotation file .Rsubread_UserProvidedAnnotation_pid126361 ... ||
## || Features : 59158 ||
## || Meta-features : 59158 ||
## || Chromosomes/contigs : 22 ||
## || ||
## || Process BAM file NL1_h3k27ac.rmdup.bam... ||
## || Single-end reads are included. ||
## || Assign alignments to features... ||
## || Total alignments : 7301635 ||
## || Successfully assigned alignments : 1381614 (18.9%) ||
## || Running time : 0.02 minutes ||
## || ||
## || Process BAM file HCC1_h3k27ac.rmdup.bam... ||
## || Single-end reads are included. ||
## || Assign alignments to features... ||
## || Total alignments : 6815573 ||
## || Successfully assigned alignments : 1704109 (25.0%) ||
## || Running time : 0.02 minutes ||
## || ||
## || Process BAM file HCC3_h3k27ac.rmdup.bam... ||
## || Single-end reads are included. ||
## || Assign alignments to features... ||
## || Total alignments : 14425593 ||
## || Successfully assigned alignments : 2443975 (16.9%) ||
## || Running time : 0.04 minutes ||
## || ||
## || ||
## \\===================== http://subread.sourceforge.net/ ======================//
# the output of Rsubread::featureCounts() function is a list
names(reads)
## [1] "counts" "annotation" "targets" "stat"
counts <- reads$counts
colnames(counts) <- c("NL1", "HCC1", "HCC3") # Set colnames for samples
head(counts) # This is the matrix we need for DESeq2
## NL1 HCC1 HCC3
## region_1 14 7 28
## region_2 6 0 1
## region_3 32 30 40
## region_4 3 6 6
## region_5 142 184 142
## region_6 8 14 20
# Save file for using afterwards
dir.create("data/differentialAnalysis/", F)
save(counts, file="data/differentialAnalysis/countMatrix.rda")
In this step we will need to add information that we feel is important for the differential analysis. The important part is that the rownames of the dataframe generated in this step, need to coincide with the colnames of the count matrix generated in the step above.
coldata <- data.frame(id=c("NL1", "HCC1", "HCC3"),
tissue=c("NormalLiver", "HCCarcinoma", "HCCarcinoma"),
patient=c(0, 1, 3))
rownames(coldata) <- coldata$id
# To use normal as reference we have to convert to factor and establish levels
coldata$tissue <- factor(coldata$tissue,
levels=c("NormalLiver", "HCCarcinoma"))
coldata
## id tissue patient
## NL1 NL1 NormalLiver 0
## HCC1 HCC1 HCCarcinoma 1
## HCC3 HCC3 HCCarcinoma 3
library(DESeq2)
# Create DESeq dataset
load("data/differentialAnalysis/countMatrix.rda") # Load count matrix
dds <- DESeqDataSetFromMatrix(countData = counts,
colData = coldata,
design = ~ tissue)
# Now run DESeq2 differential analysis
dds <- DESeq(dds)
res <- results(dds)
res
## log2 fold change (MLE): tissue HCCarcinoma vs NormalLiver
## Wald test p-value: tissue HCCarcinoma vs NormalLiver
## DataFrame with 59158 rows and 6 columns
## baseMean log2FoldChange lfcSE
## <numeric> <numeric> <numeric>
## region_1 15.6947072474895 -0.215659211495587 1.48730676118036
## region_2 2.74341012392512 -4.05403875203834 3.50288432831692
## region_3 34.7991152178152 -0.297191790295466 1.04345637092459
## region_4 5.03113662096046 0.606260020730526 2.40813001468799
## region_5 164.013384879574 -0.161398128579648 0.716750596933899
## ... ... ... ...
## region_59154 124.501202367307 0.593777658949546 0.764885797280492
## region_59155 7.98818584357692 0.527646936478058 1.99525471559366
## region_59156 10.9064408578207 -0.0431198215346853 1.70900584195298
## region_59157 4.68144684032319 -1.69069693218505 2.44414776588358
## region_59158 4.63837369744355 -1.20456314505855 2.4422971198413
## stat pvalue padj
## <numeric> <numeric> <numeric>
## region_1 -0.144999819219833 0.88471101452346 NA
## region_2 -1.15734302707799 0.247132240925086 NA
## region_3 -0.284814773838726 0.775786052926103 NA
## region_4 0.251755518611016 0.801230041049858 NA
## region_5 -0.225180319723588 0.821838998898976 0.999850732091536
## ... ... ... ...
## region_59154 0.776295835353054 0.437574331510433 0.999850732091536
## region_59155 0.264450915642148 0.791432482440711 NA
## region_59156 -0.0252309386405665 0.979870759340106 NA
## region_59157 -0.691732699546437 0.489105207816006 NA
## region_59158 -0.493209092076735 0.621864854234158 NA
# Filter significant regions: padj =< 0.05
res <- as.data.frame(res)
res.sign <- res[!is.na(res$padj) & res$padj <= 0.1,]
knitr::kable(res.sign)
| baseMean | log2FoldChange | lfcSE | stat | pvalue | padj | |
|---|---|---|---|---|---|---|
| region_27434 | 72.43625 | 5.438860 | 1.3370677 | 4.067753 | 0.0000475 | 0.0687743 |
| region_32134 | 87.05616 | 3.663612 | 0.9397686 | 3.898419 | 0.0000968 | 0.0701401 |
| region_39959 | 161.82096 | 3.070594 | 0.7837174 | 3.917987 | 0.0000893 | 0.0701401 |
| region_45908 | 157.74925 | 3.028650 | 0.7392091 | 4.097149 | 0.0000418 | 0.0687743 |
| region_48480 | 120.87701 | 3.209147 | 0.8077973 | 3.972713 | 0.0000711 | 0.0701401 |
| region_51287 | 92.95481 | 3.959399 | 0.9251722 | 4.279635 | 0.0000187 | 0.0457918 |
| region_51448 | 92.59662 | 3.956731 | 1.0373151 | 3.814396 | 0.0001365 | 0.0847669 |
| region_52386 | 160.94085 | 2.993902 | 0.7635791 | 3.920880 | 0.0000882 | 0.0701401 |
| region_53429 | 108.53384 | 3.816349 | 0.8749689 | 4.361696 | 0.0000129 | 0.0457918 |
| region_53431 | 174.52332 | 3.488785 | 0.7505037 | 4.648591 | 0.0000033 | 0.0290529 |
| region_53434 | 121.37671 | 3.431030 | 0.8066877 | 4.253232 | 0.0000211 | 0.0457918 |
| region_53828 | 287.99343 | -2.331232 | 0.6188372 | -3.767117 | 0.0001651 | 0.0957061 |
| region_55168 | 61.29972 | 5.188491 | 1.3381817 | 3.877270 | 0.0001056 | 0.0706375 |
| region_55169 | 64.09974 | 8.716567 | 2.2067034 | 3.950040 | 0.0000781 | 0.0701401 |
| region_55758 | 74.40073 | 4.454553 | 1.1154266 | 3.993587 | 0.0000651 | 0.0701401 |
DESeq2 documentation is pretty extensive and with lots of examples, you can find it here.